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Abstract. The knowledge discovery potential of the new large astro- 
nomical databases is vast. When these are used in conjunction with the 
rich legacy data archives, the opportunities for scientific discovery mul- 
tiply rapidly. A Virtual Observatory (VO) framework will enable trans- 
parent and efficient access, search, retrieval, and visualization of data 
across multiple data repositories, which are generally heterogeneous and 
distributed. Aspects of data mining that apply to a variety of science user 
scenarios with a VO are reviewed. The development of a VO should ad- 
dress the data mining needs of various astronomical research constituen- 
cies. By way of example, two user scenarios are presented which invoke 
applications and linkages of data across the catalog and image domains in 
order to address specific astrophysics research problems. These illustrate 
a subset of the desired capabilities and power of the VO, and as such they 
represent potential components of a VO Design Reference Mission. 



1. Science Requirements for Data Mining 

One of the major functions of a Virtual Observatory (VO) is to facilitate data 
mining and knowledge discovery within the very large astronomical databases 
that are now coming on-line (or soon will be). A similarly important function 
of the VO is to facilitate linkages and cross-archive investigations utilizing these 
new data in conjunction with the rich legacy data archives that preceded them. 
The scientific teams that generate large (multi- Terabyte) databases cannot begin 
to tap their full scientific potential. Thus a significant portion of the astronom- 
ical research community and a comprehensive suite of research tools should be 
brought to bear on extracting the maximum scientific return for the huge in- 
vestment in these large astronomical facilities, large surveys, and large scientific 
data systems. One approach to this problem can be identified as "data mining". 

What is data mining and why is applicable to scientific research? Data 
mining is defined as an information extraction activity whose goal is to discover 
hidden facts contained in databases. Data mining has taken the business com- 
munity by storm and the phrase has become a bit overworked to describe some 
fairly routine functions of marketing. Even so, there are consequently now a 
vast array of resources and research techniques available for exploitation by the 
scientific communities. It is useful therefore to examine a further categoriza- 
tion of data mining thrusts and their sub-components, since these are likewise 
applicable to the scientific exploration of large astronomical databases. 
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In the marketing community, data mining is used to find patterns and rela- 
tionships in data by using sophisticated techniques to build models — abstract 
representations of reality. A good model is a useful guide to understanding that 
reality and to making decisions. There are two main types of data mining mod- 
els: descriptive and predictive. Descriptive models describe patterns in data and 
are generally used to create meaningful subgroups or clusters. Predictive models 
are used to forecast explicit values, based upon patterns determined from known 
results. These models are applicable to scientific inquiry as well. 

There is another differentiation of data mining into two categories that 
we find particularly appropriate to knowledge discovery in large astronomical 
databases: event-based mining and relationship-based mining. At the risk of 
trivializing some fairly sophisticated techniques, we classify event-based mining 
scenarios into four orthogonal categories: 

• Known events / known algorithms — use existing physical models (de- 
scriptive models) to locate known phenomena of interest either spatially 
or temporally within a large database. 

• Known events / unknown algorithms — use pattern recognition and clus- 
tering properties of data to discover new observational (in our case, astro- 
physical) relationships among known phenomena. 

• Unknown events / known algorithms — use expected physical relation- 
ships (predictive models) among observational parameters of astrophysical 
phenomena to predict the presence of previously unseen events within a 
large complex database. 

• Unknown events / unknown algorithms — use thresholds to identify tran- 
sient or otherwise unique ( "one-of-a-kind" ) events and therefore to discover 
new phenomena. 

Similarly, for relationship-based mining, we identify three classes of association- 
driven scenarios that would find application in astronomical research: 

• Spatial associations — identify events (astronomical objects) at the same 
location in the sky. 

• Temporal associations — identify events occurring during the same or 
related periods of time. 

• Coincidence associations — use clustering techniques to identify events 
that are reasonably co-located within a multi-dimensional parameter space. 

Therefore, from this discussion, we can derive a reduced set of science 
requirements for data mining. These requirements correspond to the follow- 
ing set of exploratory approaches to mining large databases : Object Cross- 
Identification, Object Cross- Correlation, Nearest- Neighbor Identification, and 
Systematic Data Exploration, (a) "Object cross-identification" refers to the 
classical problem of connecting the source list in one catalog (or observation 
database) to the source list in another, in order to derive new astrophysical un- 
derstanding of the cross-identified objects (e.g., gamma-ray burst counterparts). 



Science User Scenarios for a VO 



3 



(b) "Object cross-correlation" refers to the application of "what if" scenarios to 
the full suite of parameters in a database (e.g., identify distant galaxies as U- 
band dropouts in a color-color scatter plot from the HDF survey) . (c) "Nearest- 
neighbor identification" refers to the general application of clustering algorithms 
in multi-dimensional parameter space (e.g., finding the closest known population 
of young stars - in the TW Hydrae association - through their similar kinemat- 
ics, X-ray emission, Ha, and Li abundance), (d) "Systematic data exploration" 
refers to the application of the broad range of event-based and relationship-based 
queries to a database in the hope of making a serendipitous discovery of new 
objects or a new class of objects (e.g., finding new types of variable stars, such 
as "bumpers", in the MACHO database). 



2. User Scenario #1: Estimating the Galaxy Interaction Rate 

It is well established that a significant fraction of all galaxies have been involved 
in a galaxy-galaxy interaction and perhaps a merger at some time(s) in their past. 
The rate of these interactions is not yet well determined empirically: either the 
current rate for galaxies in the nearby Universe, or the cosmologically evolving 
rate in the distant Universe. Numerical simulations of the galaxy population and 
of the evolving hierarchical structure within various cosmological scenarios give a 
handle on the interaction and merger rates, which naturally depend on the choice 
of cosmological model. In general, the simulations confirm the importance and 
relatively high frequency of occurrence of interactions and mergers. Given the 
cosmological significance of interactions to galaxy formation and evolution, it is 
important to derive a firm value for the galaxy interaction rate observationally, 
for comparison with the numerical models, which will in turn help to narrow the 
plausible range of cosmological models, galaxy formation models, and galaxy 
evolution models. 

We attempted an initial exploration of several on-line databases in order to 
estimate the galaxy interaction rate. We began by exploring an on-line catalog 
of galaxies (available through NASA's ADC = Astronomical Data Center): the 
Updated Zwicky Catalog of Falco et al. (1999). This catalog identifies multiple- 
galaxy groupings, which we used to reduce the full list of 19,000 galaxies to the 
set of 1800 multiples. We then selected a very small sub-sample from this list to 
conduct a proof-of-concept investigation. We used existing catalog visualization 
tools and archive linkage tools at the ADC to find all possible NASA mission 
data and most of the all-sky survey data for these selected objects (Kargatis 
et al 1999). We then identified characteristics in the optical images or in the 
IRAS fluxes or in the X-ray emissions to verify that the associated multiple 
galaxy systems are in fact (to high probability) bound groups (pairs, triples, 
quartets, etc.). The expectation that these small galaxy-galaxy separations and 
other evidences for physical association do in fact imply an on-going interaction 
was often confirmed through inspection of the DSS (Digital Sky Survey) imagery, 
which showed signs of interaction in many cases (e.g., distorted morphologies). 
Thus, by applying knowledge of astrophysical signatures of interactions, we were 
able to explore multiple databases (ADC catalogs, NASA mission archives, and 
ground-based sky surveys) in a coherent organized manner. We estimate that 
the galaxy interaction rate in the local Universe is approximately 8%. 
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3. User Scenario #2: In Search of the CIB 

Among the exciting results of the COBE mission was the discovery by the 
DIRBE team of an extragalactic CIB (Cosmic Infrared Background; Hauser 
et al. 1988). There has been a storm of activity in the research community to 
identify the sources of the CIB and to understand their power sources (i.e., what 
powers the strong IR emissions? are they dust-enshrouded quasars? or dusty 
starbursts? or both?). Some possible counterparts to the CIB include: (a) ul- 
traluminous IR galaxies (ULIRGs; see Sanders & Mirabel 1996 for a review); 
(b) SCUBA submm sources (Barger et al. 1999; Blain et al. 1999); (c) IR-selected 
AGN (a new population of AGN identified through the 2MASS survey; Beichman 
et al. 1998); or (d) Extremely Red Objects (EROs), which have been reported 
in several recent deep surveys (Smail et al. 1999; Thompson et al. 1999). 

We initiated a proof-of-concept search scenario for identifying potential can- 
didate contributors to the CIB. Our approach is similar to that of Haasrma & 
Partridge (1998), except that we are applying the full power of on-line databases 
and linkages between these databases, archives, and published literature. Our 
search scenario involved finding object cross-identifications among the IRAS 
Faint Source Catalog and FIRST survey catalog, and then attempting to find 
those commonly identified objects also within other databases, such as the HST 
observation log. In a very limited sample of targets that we investigated to test 
our "ADC as a mini-NVO" approach to the problem, we did find one object 
in common among the HST-IRAS-FIRST databases: a known hyperluminous 
infrared galaxy (HyLIRG) at z=0.780 harboring an AGN, which was specifically 
imaged by HST because of its known HyLIRG characteristics. In this extremely 
limited test scenario, we did in fact find what we were searching for: a distant 
IR-luminous galaxy that is a likely contributor to the CIB, or else similar in 
characteristics to the more distant objects that likely comprise the CIB. 

The preliminary results of the investigations described above for User Sce- 
narios #1 and #2 (including screen shots of the user interfaces employed in the 
studies) are presented at the following web site (under the category "How to use 
the ADC for scientific research projects"): 

http : //adc . gsf c . nasa . gov/adc/how_to . html 
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